Tutorial of MendelKinship

last update: 10/18/2018

Julia version

Current code supports Julia version 0.6.4, and with luck it may work in v0.7.

In [1]:
versioninfo()
Julia Version 0.6.4
Commit 9d11f62bcb (2018-07-09 19:09 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

Plotting package versions

The package PlotlyJS must be on version 10.2 and Blink must be on version 0.6.2. One can check package versions by doing Pkg.status("Blink"). If your version number exceeds these, you can tell julia to use an older version by pinning them as follows:

In [2]:
Pkg.status("Blink")
 - Blink                         0.7.0
In [3]:
Pkg.pin("Blink", v"0.6.2")
Pkg.pin("PlotlyJS", v"0.10.2")
Pkg.build("Blink")
INFO: Creating Blink branch pinned.97e9ab8e.tmp
INFO: Removing FunctionalCollections v0.3.2
INFO: Removing JSExpr v0.2.2
INFO: Removing Observables v0.1.2
INFO: Removing WebIO v0.2.8
INFO: Removing Widgets v0.2.5
INFO: Creating PlotlyJS branch pinned.4f0c99ae.tmp
INFO: No packages to install, update or remove
INFO: Building HttpParser
INFO: Building MbedTLS
Info: using prebuilt binaries
INFO: Building Blink

When to use MendelKinship

MendelKinship.jl is capable of calculating the theoretical kinship coefficient $\Phi_{ij}$ as long as a valid pedigree structure is provided. When SNP markers are available, MendelKinship.jl can also calculate empirical kinship coefficients using the MoM or GRM methods (see this paper for details). Here we recommend the MoM method because its estimates are more robust in the presence of rare alleles.

MendelKinship can optionally compare the empirical kinship and theoretical kinships to check suspect pedigree structures and reveal hidden relatedness. It can also reveal sample mixed ups or other laboratory errors that can lead to inaccurate empirical kinships. The result is saved in a table sorted in descending order. We optioanlly output 2 interactive plots that allow users to quickly pinpoint pairs with the greatest theoretical vs empirical deviance.

Data used in Examples

The input for all examples in this tutorial can be obtained from the free application Mendel v16 option 29a. These data were obtained from the 1000 genome project, containing 85 people and 253141 SNPs, half of which have $maf < 0.05$. Using these founders' genotype, we simulated 127 extra people, resulting in 27 pedigrees and 212 people. Although the 85 individuals are treated as founders, they were actually somewhat related, and this is reflected in the kinship comparison in the 2nd example below. For more information on this dataset, please see Mendel's documentation example 29.4.

MendelKinship additionally accepts PLINK binary format as input, in which case the triplets (data.bim, data.bed, data.fam) must all be present. In this tutorial, there are no examples that uses these to import pedigree and SNP information. But if available, one can import the data by specifying the following in the control file:

plink_input_basename = data

However, sometimes the .fam file contains non-unique person id (2nd column of .fam file) across different pedigrees, which is currently not permitted. A person's id cannot be repeated in other pedigrees, even if it is contextually clear that they are different persons. This will be fixed in the near future.

Analysis keywords available to users

Keyword Default Value Allowed value Description
kinship_output_file Kinship_Output_File.txt true/false OpenMendel generated output file with table of kinship coefficients
repetitions 1 Integer Repetitions for sharing statistics
xlinked_analysis false beelean Whether markers are on the X chromosome
compare_kinships false boolean Whether we want to compare theoretical vs empiric kinship
kinship_plot "" User defined file name A user specified name for a plot comparing theoretical and empiric kinship value
z_score_plot "" User defined file name A user specified name for a plot of fisher's z statistic.
grm_method MoM GRM, MoM Method used for empiric kinship calculation. Defaults to MoM (default), but user could choose the more common GRM method instead. (Warning: Based on our experience, Fisher's z score is very unreliable if the GRM method is used for rare (maf < 0.2) snps)
maf_threshold 0.01 Real number between 0 and 1 The minor allele frequency threshold for the GRM computation
deviant_pairs false Integer less than $n(n+1)/2$ Number of top deviant pairs (theoretical vs empiric kinship) the user wants to keep

A list of OpenMendel keywords common to most analysis package can be found here

Example 1: Theoretical Kinship Coefficient Calculation

Step 1: Preparing the pedigree files:

Recall what is a valid pedigree structure. Note that we require a header line. The extension .in have no particular meaning. Let's examine (the first few lines of) such an example:

In [4]:
;head -10 "Ped29a.in"
Pedigree,Person,Mother,Father,Sex,,,simTrait
  1       ,  16      ,          ,          ,  F       ,          ,  29.20564,
  1       ,  8228    ,          ,          ,  F       ,          ,  31.80179,
  1       ,  17008   ,          ,          ,  M       ,          ,  37.82143,
  1       ,  9218    ,  17008   ,  16      ,  M       ,          ,  35.08036,
  1       ,  3226    ,  9218    ,  8228    ,  F       ,          ,  28.32902,
  2       ,  29      ,          ,          ,  F       ,          ,  36.17929,
  2       ,  2294    ,          ,          ,  M       ,          ,  42.88099,
  2       ,  3416    ,          ,          ,  M       ,          ,  40.98316,
  2       ,  17893   ,  2294    ,  29      ,  F       ,          ,  35.55038,

Step 2: Preparing the control file

A control file gives specific instructions to MendelKinship. To perform theoretical kinship calculation, an minimal control file looks like the following:

In [5]:
;cat "control_just_theoretical_29a.txt"
#
# Input and Output files.
#
pedigree_file = Ped29a.in
#
# Analysis parameters for Kinship option.
#
kinship_output_file = just_theoretical_output.txt

Step 3: Run the analysis in Julia REPL or directly in notebook

We used the package Suppressor to hide warnings. They will be removed when we update MendelKinship to Julia version 1.0. However often informative warnings and/or MendelKinship messages will be printed, so it is best practice for new users to at least review the messages.

In [6]:
using MendelKinship, CSV
using Suppressor # used to hide warnings
@suppress begin
    Kinship("control_just_theoretical_29a.txt")
end
INFO: Precompiling module PlotlyJS.
INFO: Loading HttpServer methods...

Plotly javascript loaded.

To load again call

init_notebook(true)

INFO: Precompiling module Suppressor.
PedigreePerson1Person2KinshipDelta7delta1delta2delta3delta4delta5delta6delta7delta8delta9
1116160.51.00.00.00.00.00.00.01.00.00.0
2116170080.00.00.00.00.00.00.00.00.00.01.0
311632260.1250.00.00.00.00.00.00.00.01.00.0
411682280.00.00.00.00.00.00.00.00.00.01.0
511692180.250.00.00.00.00.00.00.00.01.00.0
6117008170080.51.00.00.00.00.00.00.01.00.00.0
711700832260.1250.00.00.00.00.00.00.00.00.01.0
811700892180.250.00.00.00.00.00.00.00.01.00.0
91322632260.51.00.00.00.00.00.00.01.00.00.0
1018228170080.00.00.00.00.00.00.00.00.00.01.0
111822832260.250.00.00.00.00.00.00.00.01.00.0
121822882280.51.00.00.00.00.00.00.01.00.00.0
131822892180.00.00.00.00.00.00.00.00.00.01.0
141921832260.250.00.00.00.00.00.00.00.01.00.0
151921892180.51.00.00.00.00.00.00.01.00.00.0
16214695146950.51.00.00.00.00.00.00.01.00.00.0
17214695178930.250.250.00.00.00.00.00.00.01.00.0
1821469569520.1250.00.00.00.00.00.00.00.00.01.0
19217893178930.51.00.00.00.00.00.00.01.00.00.0
2021789369520.250.00.00.00.00.00.00.00.01.00.0
2122294146950.250.00.00.00.00.00.00.00.01.00.0
2222294178930.250.00.00.00.00.00.00.00.01.00.0
232229422940.51.00.00.00.00.00.00.01.00.00.0
242229434160.00.00.00.00.00.00.00.00.00.01.0
252229439160.250.00.00.00.00.00.00.00.01.00.0
262229467900.250.00.00.00.00.00.00.00.01.00.0
272229469520.1250.00.00.00.00.00.00.00.01.00.0
28229146950.250.00.00.00.00.00.00.00.01.00.0
29229178930.250.00.00.00.00.00.00.00.01.00.0
3022922940.00.00.00.00.00.00.00.00.00.01.0

Step 4: Interpreting the result

MendelKinship should have generated the filejust_theoretical_output.txt in your local directory. One can directly open the file, or import into the Julia environment for ease of manipulation using the DataFrames package. The fourth column contains the desired theoretical kinship coefficient. The 5th column contains the (deterministically) estimated Delta7 matrix. The 6th through the 14 columns contain the (stochastically) estimated Jacquard's 9 identity coefficients.

Example 2: Compare theoretical/empirical kinship values

When both pedigree structure and complete SNP information are available, we can compare theoretical/empirical kinship coefficients. In practice, however, we often have individuals without genotype information, but nevertheless must be included in the pedigree structure. MendelKinship does not handle this situation yet, but an analysis option that supports these data is being developed. For now you can impute genotypes but keep in mind that the relationship comparison for these individuals who lack all genotype information will not be meaningful.

Step 1: Prepare pedigree file and SNP data file

The pedigree file is the same as the pedigree file in the previous example. The SNP definition file requires a header row, and should have approprietely placed commas. It may be informative to compare the following SNP definition file with the original "SNP_def29a.in" in Mendel Option 29a.

In [20]:
;head -10 "SNP_def29a_converted.txt"

The SNP data files in this case must be stored in PLINK BED file in SNP-major format, with an accompanying SNP definition file. For an explanation of what these are, see MendelBase documentation.

If your have "data.bim", "data.bed", "data.fam" (i.e. the 3 triplet of PLINK files), then you can replace the 3 fields snpdata_file, snpdefinition_file, and pedigree_file in the next step with just 1 field:

plink_input_basename = data.

Step 2: Preparing control file

The following control file tells MendelKinship to compare theoretical kinship and empirical kinship, and output 2 interactive plots stored in .html format.

In [21]:
;cat "control_compare_29a.txt"

Step 3: Running the analysis

In [22]:
using MendelKinship, CSV
using Suppressor # used to hide warnings
@suppress begin
    Kinship("control_compare_29a.txt")
end
Pedigree1Pedigree2Person1Person2theoretical_kinshipempiric_kinshipfishers_zscore
11414267322640.00.1095525.31942
2313115884197700.250.150364-4.2355
3232399433920.1250.0225133-4.20155
4251422041166360.00.09697154.75137
5252511822241920.250.159229-3.82975
61414257322640.1250.2166224.62515
72525301230160.1250.2138884.4971
8251723404120040.0-0.0896437-3.60943
9252323404192790.0-0.0877967-3.52627
101714268572640.00.08599534.25691
1110040823452260.0-0.0859709-3.44408
12231974383440.0-0.0856543-3.42983
131919223757200.1250.0405078-3.39689
142523234047430.0-0.0831952-3.3192
15100408234132340.0-0.0805884-3.20196
162383121190990.0-0.0792375-3.14122
1710040823473950.0-0.079153-3.13743
18100408234190990.0-0.0789209-3.12699
191717908104180.250.174596-3.12361
2031231977037600.0-0.0783826-3.1028
213131152197700.1250.0472729-3.0941
22141425732257320.50.556283.89575
23100401123476700.0-0.0772533-3.05204
24251724192120040.0-0.0769367-3.03781
25251723404191130.0-0.0769156-3.03687
262319743236990.0-0.0768523-3.03402
271781200452260.0-0.0767573-3.02975
282319376083440.0-0.0763879-3.01315
291002917434120040.0-0.0760713-2.99893
30100402523430120.0-0.0760713-2.99893

Step 4: Interpreting the Result

Running the previous control file should have generated:

  • 1 table containing all the pairwise kinship and theoretical comparisons (kinship_file_output.txt)
  • 2 interactive plots stored in .html format

Generated Table:

The table is sorted in descending order of the largest deviance between the theoretical and empiric kinship. The last column lists the Fisher's Z statistic (i.e. the number of standard deviations away from mean).

Interesting remark: In the first row, person 26732 and 264 have 0 theoretical kinship (i.e. they are both founders) but their empirical kinship is pretty close to 0.125 = 1/8. That is, we discovered that these 2 people which we initially thought are unrelated, may be half siblings, grandparent-grandchild, or an avuncular pair. There may also have been a sample mix up. Another explanation is that we are only using one chromosome's worth of data and so the estimates of kinship may be imprecise.

Generated Plots:

The code below are not necessary for the user to understand. The 2 plots should automatically be generated in your directory. Unfortunately the plots cannot be imported into the notebook (because they are stored as .html, which is needed to enable the interactive sessions). So the code that generated the plots is pasted here to regerate the plots.

In [23]:
using PlotlyJS

result = CSV.read("kinship_file_output.txt")
name = Vector{String}(size(result, 1))
for i in 1:length(name)
    name[i] = "Person1=" * string(result[i, 3]) * ", " * "Person2=" * string(result[i, 4])
end

function linescatter()
    trace1 = scatter(;x=result[:theoretical_kinship], 
        y=result[:empiric_kinship], mode="markers", 
        name="empiric kinship", text=name)
    
    trace2 = scatter(;x=[1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 0.0],
        y=[1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 0.0], 
        mode="markers", name="marker for midpoint")
        
    layout = Layout(;title="Compare empiric vs theoretical kinship",
        hovermode="closest", 
        xaxis=attr(title="Theoretical kinship (θ)", showgrid=false, zeroline=false),
        yaxis=attr(title="Empiric Kinship", zeroline=false))
    
    data = [trace1, trace2]
    plot(data, layout)
end
linescatter()
Out[23]:

Explanation of previous plot

This interactive plot allows user to quickly identify which pairs of persons have an empirical kinship most deviated from their expected (theoretical) kinship. The midpoint is placed as an orange dot for interpretability. As an example, the first row in the table above is the highest point on the left most spread. Careful readers might observe that there is a wider spread on those with 0 expected theoretical kinship. This is expected, because most people are not related to each other, so we are making many more comparisons that have 0 expected kinship.

In [24]:
function two_hists()
    trace1 = histogram(x=result[:fishers_zscore], text=name)
    data = [trace1]
    
    layout = Layout(barmode="overlay", 
        title="Z-score plot for Fisher's statistic",
        xaxis=attr(title="Standard deviations"),
        yaxis=attr(title="count"))
    
    plot(data, layout)
end
two_hists()
Out[24]:

Explanation of Plot 2:

Fisher's z transform should give us samples from a standard normal distribution $N(0, 1)$. We ploted this statistic in plot 2, and at first glance, the distribution is approximately normal. In Julia, we can easily verify this by computing some summary statistics:

Compute mean and variance

In [25]:
my_zscore = convert(Vector{Float64}, result[:fishers_zscore])
mean(my_zscore), var(my_zscore)
Out[25]:
(-2.0141170647819297e-17, 0.9999999999999999)

Compute skewness and excess kurtosis

In [26]:
using StatsBase
skewness(my_zscore), kurtosis(my_zscore)
Out[26]:
(0.09239764218355156, 0.1065722273622467)